Introduction

The data source that I’m looking at concerns the life expectancy for 193 countries based on a set of variables that range from economic conditions to disease prevalence and mortality rates. The data is collected over 15 years for most countries and can show trends within country as well as across countries. The scope is also intriguing to me—this looks at nearly 200 countries categorized as developing versus developed—and attempts to make broad judgements based on 20 variables. I’m looking forward to analyzing along economic lines and see how expenditures (as found from the United Nations source) relate to life expectancy. My cleaning process is outlined in data_cleaning.R, and includes cleaning messy variable titles, renaming mislabeled variables, and assigning NA to values that were determined to be inaccurately filled in. Some missing values of note are those that replaced erroneous entries–national populations below 1000, expenditures of 0, and infant deaths of 0 were some values replaced with NA in addition to initially missing data. Though it is less informative to have missing data, it is more valuable than using misleading data. In total there are 4053 missing values (2563 before cleaning). The codebook details the exact number of missing values for each variable.

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   country = col_character(),
##   status = col_character()
## )
## See spec(...) for full column specifications.
##                                 vars    n        mean          sd     median
## country*                           1 2938         NaN          NA         NA
## year                               2 2938     2007.52        4.61    2008.00
## status*                            3 2938         NaN          NA         NA
## life_expectancy                    4 2928       69.22        9.52      72.10
## adult_mortality                    5 2928      164.80      124.29     144.00
## infant_deaths                      6 2090       42.60      137.94       9.00
## alcohol                            7 2744        4.60        4.05       3.75
## percentage_expenditure             8 2327      932.09     2192.97     155.20
## hepatitis_b                        9 2385       80.94       25.07      92.00
## measles                           10 2938     2419.59    11467.27      17.00
## bmi                               11 2904       38.32       20.04      43.50
## under_five_deaths                 12 2938       42.04      160.45       4.00
## polio                             13 2919       82.55       23.43      93.00
## total_expenditure                 14 2712        5.94        2.50       5.75
## diphtheria                        15 2919       82.32       23.72      93.00
## hiv_aids                          16 2938        1.74        5.08       0.10
## gdp                               17 2490     7483.16    14270.17    1766.95
## population                        18 2255 12928693.79 61411764.07 1435568.00
## thinness_10_to_19_years           19 2904        4.84        4.42       3.30
## thinness_5_to_9_years             20 2904        4.87        4.51       3.30
## income_composition_of_resources   21 2771        0.63        0.21       0.68
## schooling                         22 2775       11.99        3.36      12.30
##                                    trimmed        mad     min          max
## country*                               NaN         NA     Inf         -Inf
## year                               2007.52       5.93 2000.00 2.015000e+03
## status*                                NaN         NA     Inf         -Inf
## life_expectancy                      69.91       8.60   36.30 8.900000e+01
## adult_mortality                     150.51     112.68    1.00 7.230000e+02
## infant_deaths                        17.40      11.86    1.00 1.800000e+03
## alcohol                               4.23       4.81    0.01 1.787000e+01
## percentage_expenditure              361.18     213.65    0.10 1.947991e+04
## hepatitis_b                          86.89       8.90    1.00 9.900000e+01
## measles                             286.08      25.20    0.00 2.121830e+05
## bmi                                  39.05      24.17    1.00 8.730000e+01
## under_five_deaths                    14.15       5.93    0.00 2.500000e+03
## polio                                88.05       8.90    3.00 9.900000e+01
## total_expenditure                     5.85       2.36    0.37 1.760000e+01
## diphtheria                           87.99       8.90    2.00 9.900000e+01
## hiv_aids                              0.54       0.00    0.10 5.060000e+01
## gdp                                3751.73    2360.98    1.68 1.191727e+05
## population                      4051514.13 2078239.00 1141.00 1.293859e+09
## thinness_10_to_19_years               4.14       3.41    0.10 2.770000e+01
## thinness_5_to_9_years                 4.15       3.41    0.10 2.860000e+01
## income_composition_of_resources       0.65       0.19    0.00 9.500000e-01
## schooling                            12.17       3.11    0.00 2.070000e+01
##                                        range  skew kurtosis         se
## country*                                -Inf    NA       NA         NA
## year                            1.500000e+01 -0.01    -1.21       0.09
## status*                                 -Inf    NA       NA         NA
## life_expectancy                 5.270000e+01 -0.64    -0.24       0.18
## adult_mortality                 7.220000e+02  1.17     1.74       2.30
## infant_deaths                   1.799000e+03  8.32    83.16       3.02
## alcohol                         1.786000e+01  0.59    -0.81       0.08
## percentage_expenditure          1.947981e+04  4.11    20.65      45.46
## hepatitis_b                     9.800000e+01 -1.93     2.76       0.51
## measles                         2.121830e+05  9.43   114.58     211.56
## bmi                             8.630000e+01 -0.22    -1.29       0.37
## under_five_deaths               2.500000e+03  9.49   109.49       2.96
## polio                           9.600000e+01 -2.10     3.76       0.43
## total_expenditure               1.723000e+01  0.62     1.15       0.05
## diphtheria                      9.700000e+01 -2.07     3.55       0.44
## hiv_aids                        5.050000e+01  5.39    34.80       0.09
## gdp                             1.191711e+05  3.20    12.29     285.98
## population                      1.293858e+09 15.79   293.20 1293237.53
## thinness_10_to_19_years         2.760000e+01  1.71     3.96       0.08
## thinness_5_to_9_years           2.850000e+01  1.78     4.34       0.08
## income_composition_of_resources 9.500000e-01 -1.14     1.38       0.00
## schooling                       2.070000e+01 -0.60     0.88       0.06

This description is useful to have as a starting point to see what may be worth investigating. In particular, the number of values in each column is useful (how many values had to be represented as NA) and the standard deviations.

Initial Analysis

Below I plot every variable against the target variable. This is a starting point to see which variables seem to be predictors of life expectancy. My variables of interest are economic factors and alcohol consumption so I will address those as well.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'



A majority of these graphs indicate no apparent relationship. However we can see a positive relationship with life expectancy for both schooling and income composition of resources and a negative relationship with adult mortality

We can see a very strong relationship appear here (noting outliers). Using geom_bin2d allows for a more informative and descriptive visual–we can see that the highest density falls along the line with a small spread shown by the darker blue outline. This is more informative than using geom_smooth as it demonstrates the densest points and spread. What is interesting about this conclusion is that income composition of resources attempts to describe a more individual level of economic prosperity of a country as opposed to nationwide economic growth. As there is a relationship here and not one with GDP, it may be able to inform policy

To confirm the lack of relationship between GDP and income composition of resources I used geom_bin2d again to see how they interacted and there is a clear lack of a strong relationship between the two. Why this result is particularly notable is that it shows there is a difference between which economic factors impact life expectancy and can inform how national and international funds are allocated.



Income composition of resources also exhibits a negative correlation with adult mortality (noting that the density is highest along a linear curve). Adult mortality measured against schooling demonstrates less of a dramatic relationship, though definitively negative.

Developed versus Developing and Economic Factors

The source of this data did not mention how developing versus developed was defined, though it does mention that the data was an assortment with components from the UN. Therefore I will use the UN’s definition (“reflect basic economic country conditions”) to understand the data.

## # A tibble: 2 x 2
##   `status == "Developed"`     n
##   <lgl>                   <int>
## 1 FALSE                    2426
## 2 TRUE                      512

There is a very clear trend of developed countries having a higher life expectancy. This is not surprising as developed is defined in an economic sense, meaning that they have the resources to allocate to public health and safety. I used a density plot with different colors and linetypes to make the difference clear as well as to account for the fact that there is a far greater number of countries classified as developing than developed. Using density means that we see how the proportion of countries classified as developing is distributed as opposed to seeing sheer amounts which would present a misleading description of the distribution.

As developed versus developing is defined in economic terms, let’s look at how economic variables interact.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'



What is puzzling is that the relationship between gdp and percentage expenditure seems to be nearly linear. This would suggest that every country increases their allocation of resources to health at similar rates to one another. There is an obvious flaw in the actual number for percentages as they exceed 100, but as more than 70% of the values for percentage expenditure exceed 100, it may be useful to observe them as indicative of a trend and maybe mislabeled by the author.

Here a boxplot was used despite GDP being continuous in order to demonstrate the quartiles and outliers clearly. Then, I used cutwidth to separate data into discrete bins and set varwidth equal to true so that equal numbers of values were in each grouping. The plot shows a relatively steady increase in average life expectancy as well as steady growth of the bottom quartile.

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.



After initially looking at the distribution of developed versus developing countries in terms of schooling using a color coded scatter plot, I zoomed into the upper right corner which revealed some outliers. I used geom_text to see which countries were notable. These countries (Germany, Canada, Finland, and France in particular) seem to be misclassed as develping. To test this, I looked at other economic metrics (as that is how their status is defined according to the UN).

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.



This graph shows the same outlying countries as developing but with a high GDP as well–could indicate a mistake in classification.

Effect of Alcohol

On a different note, but another area I thought worth exploring, is the effect of alcohol consumption on life expectancy

An interesting plot–far higher recorded alcohol consumption in developed countries which may be counterintuitive as alcohol generally has health detriments associated. Again using a density plot when comparing developed to developing countries to account for the imbalance in numbers classified. Going to look into it with another metric

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'



Does not reveal any meaningful trend when compared to life expectancy so suggests that it’s an irrelevant relationship.

Conclusion

Though this data has a lot of apparent errors in how it reported data, some useful conclusions can be drawn from this analysis. We can see how income composition of resources, GDP, and schooling interacted with one another and could be used as predictive variables for life expectancy. The relationship between GDP and life expectancy was not telling at all, and in fact, income composition of resources had a positive correlation with the latter as did schooling. When compared to one another, schooling and income composition of resources showed to also have a strong positive correlation with a high density along a linear relationship and showing a linear spread. GDP demonstrated less utility than other economic variables and pointed in a more promising direction.